评估 多分类

多分类同样有错误率、准确率

$$ \begin{align*} \quad E_D (f) = \frac{1}{m} \sum_{i \in [m]} \Ibb (y_i \ne f(\xv_i)), \quad \text{Acc}_D (f) = 1 - E_D (f) \end{align*} $$

以及混淆矩阵

预测第$1$ 预测第$2$    ...    预测第$c$
真实第$1$
真实第$2$
...
真实第$c$
评估 交叉熵损失

错误率 (0-1 损失) 不连续、难优化,通常采用交叉熵损失

$c$个类别的预测函数分别为$f_1, \ldots, f_c$,则样本$x$的预测结果为

$$ \begin{align*} \quad \pv = \left[ \frac{e^{f_1(x)}}{\sum_{j \in [c]} e^{f_i(x)}}, \frac{e^{f_2(x)}}{\sum_{j \in [c]} e^{f_i(x)}}, \ldots, \frac{e^{f_c(x)}}{\sum_{j \in [c]} e^{f_i(x)}} \right] \quad \longleftarrow \text{softmax} \end{align*} $$

这是一个$c$维向量,同时也是一个离散概率分布

类标记$y$可转化为独热编码$\ev_y$,这也是一个$c$维离散概率分布

替代损失的要求:关于$\pv$$\ev_y$连续,且$\pv$$\ev_y$越接近损失越小

问题:给定离散概率分布$\qv$,如何度量分布$\pv$与它的距离?

评估 交叉熵损失

交叉熵 (cross-entropy) $H_{\qv} (\pv) \triangleq - \sum_i q_i \ln p_i$

$\pv = \qv$时交叉熵最小,此时交叉熵$H_{\qv} (\pv)$即为分布$\qv$的熵$H(\qv)$

$$ \begin{align*} \quad \min_{\pv} H_{\qv} (\pv) = - \sum_i q_i \ln p_i, \quad \st ~ \sum_i p_i = 1 \end{align*} $$

拉格朗日函数为$L(p_i, \alpha) = - \sum_i q_i \ln p_i + \alpha (\sum_i p_i - 1)$,于是

$$ \begin{align*} \quad \nabla_{p_i} L(p_i, \alpha) & = - \frac{q_i}{p_i} + \alpha = 0 \Longrightarrow q_i = \alpha p_i \\ & \Longrightarrow \sum_i q_i = \alpha \sum_i p_i \Longrightarrow \alpha = 1 \Longrightarrow \pv = \qv \end{align*} $$

评估 交叉熵损失

$(x,y)$$y \in [c]$,交叉熵损失为$- \ln \frac{e^{f_y(x)}}{\sum_{j \in [c]} e^{f_i(x)}}$

$(x,y)$$y \in \{1, -1\}$$\qv = [(1+y)/2; (1-y)/2]$,交叉熵损失为

$$ \begin{align*} \quad \text{CE} & = - \frac{1+y}{2} \ln \frac{e^{f_1(x)}}{e^{f_1(x)}+e^{f_2(x)}} - \frac{1-y}{2} \ln \frac{e^{f_2(x)}}{e^{f_1(x)}+e^{f_2(x)}} \\ & = - \frac{1+y}{2} \ln \frac{e^{f_1(x)-f_2(x)}}{e^{f_1(x)-f_2(x)}+1} - \frac{1-y}{2} \ln \frac{1}{e^{f_1(x)-f_2(x)}+1} \\ & = - \frac{1+y}{2} \ln \frac{e^{w(x)}}{e^{w(x)}+1} - \frac{1-y}{2} \ln \frac{1}{e^{w(x)}+1} \quad \leftarrow w(x) \triangleq f_1(x)-f_2(x) \\ & = \begin{cases} \ln (1 + e^{-w(x)}), & y = 1 \\ \ln (1 + e^{-w(x)}), & y = -1 \end{cases} \\ & = \ln (1 + e^{- y w(x)}) \end{align*} $$

由此可见,多分类的交叉熵损失就是二分类的对率损失的拓展

机器学习一般流程

g cluster_1 特征工程 原始数据 原始数据  特征提取    特征提取   原始数据 ->  特征提取    特征处理    特征处理    特征提取  ->  特征处理   特征变换 特征变换  特征处理  -> 特征变换 模型学习 模型学习 特征变换 -> 模型学习 预测 预测 模型学习 ->预测

原始数据:表格、图片、视频、文本、语音、……

模型学习:最核心的部分,学习一个用来预测的映射

特征工程:

  • 提取:选取、构造对目标任务有用的潜在特征
  • 处理:无序的离散类别特征 → 数值特征,缺失处理,标准化
  • 变换:对特征进行挑选或映射得到对目标任务更有效的特征
二分类示例

from sklearn.datasets import load_breast_cancer

breast_cancer = load_breast_cancer()
print(breast_cancer.DESCR)
# --------------------
# **Data Set Characteristics:**
# 
# :Number of Instances: 569
# 
# :Number of Attributes: 30 numeric, predictive attributes and the class
# 
# :Attribute Information:
#     - radius (mean of distances from center to points on the perimeter)
#     - texture (standard deviation of gray-scale values)
#     - perimeter
#     - area
#     - smoothness (local variation in radius lengths)
#     - compactness (perimeter^2 / area - 1.0)
#     - concavity (severity of concave portions of the contour)
#     - concave points (number of concave portions of the contour)
#     - symmetry
#     - fractal dimension ("coastline approximation" - 1)
# 
#     The mean, standard error, and "worst" or largest (mean of the three
#     worst/largest values) of these features were computed for each image,
#     resulting in 30 features.  For instance, field 0 is Mean Radius, field
#     10 is Radius SE, field 20 is Worst Radius.
# 
#     - class:
#             - WDBC-Malignant  恶性
#             - WDBC-Benign     良性
# 
# :Summary Statistics:
# 
# ===================================== ====== ======
#                                         Min    Max
# ===================================== ====== ======
# radius (mean):                        6.981  28.11  半径
# texture (mean):                       9.71   39.28  纹理
# perimeter (mean):                     43.79  188.5  周长
# area (mean):                          143.5  2501.0 面积
# smoothness (mean):                    0.053  0.163  平滑度
# compactness (mean):                   0.019  0.345  紧凑度
# concavity (mean):                     0.0    0.427  凹度
# concave points (mean):                0.0    0.201  凹点
# symmetry (mean):                      0.106  0.304  对称性
# fractal dimension (mean):             0.05   0.097  分形维数
# radius (standard error):              0.112  2.873
# texture (standard error):             0.36   4.885
# perimeter (standard error):           0.757  21.98
# area (standard error):                6.802  542.2
# smoothness (standard error):          0.002  0.031
# compactness (standard error):         0.002  0.135
# concavity (standard error):           0.0    0.396
# concave points (standard error):      0.0    0.053
# symmetry (standard error):            0.008  0.079
# fractal dimension (standard error):   0.001  0.03
# radius (worst):                       7.93   36.04
# texture (worst):                      12.02  49.54
# perimeter (worst):                    50.41  251.2
# area (worst):                         185.2  4254.0
# smoothness (worst):                   0.071  0.223
# compactness (worst):                  0.027  1.058
# concavity (worst):                    0.0    1.252
# concave points (worst):               0.0    0.291
# symmetry (worst):                     0.156  0.664
# fractal dimension (worst):            0.055  0.208
# ===================================== ====== ======
# 
# :Missing Attribute Values: None
# 
# :Class Distribution: 212 - Malignant, 357 - Benign
# 
# :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian
# 
# :Donor: Nick Street
# 
# :Date: November, 1995
# 
# This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
# https://goo.gl/U2Uwz2
# 
# 
# 
# Separating plane described above was obtained using
# Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
# Construction Via Linear Programming." Proceedings of the 4th
# Midwest Artificial Intelligence and Cognitive Science Society,
# pp. 97-101, 1992], a classification method which uses linear
# programming to construct a decision tree.  Relevant features
# were selected using an exhaustive search in the space of 1-4
# features and 1-3 separating planes.
# 
# The actual linear program used to obtain the separating plane
# in the 3-dimensional space is that described in:
# [K. P. Bennett and O. L. Mangasarian: "Robust Linear
# Programming Discrimination of Two Linearly Inseparable Sets",
# Optimization Methods and Software 1, 1992, 23-34].
# 
# This database is also available through the UW CS ftp server:
# 
# ftp ftp.cs.wisc.edu
# cd math-prog/cpo-dataset/machine-learn/WDBC/
# 
# .. dropdown:: References
# 
#   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction
#     for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on
#     Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
#     San Jose, CA, 1993.
#   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
#     prognosis via linear programming. Operations Research, 43(4), pages 570-577,
#     July-August 1995.
#   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
#     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994)
#     163-171.

混淆矩阵